AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customer queries are resolved faster.
◎ Personalize Marketing campaigns to target both new and existing customers.
◎ Upgrade the service delivery model, to ensure that customer queries are resolved faster.
#
# Loading Necessary Libraries
#
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import animation
sns.set(
color_codes=True
) # -----This adds a background color to all the plots created using seaborn
# Allow the use of Display via interactive Python
from IPython.display import display
# Import tabulate. library used for creating tables in a visually appealing format.
from tabulate import tabulate
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist, pdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import (
KElbowVisualizer,
SilhouetteVisualizer,
InterclusterDistance,
)
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# For Dimensionality Reduction
from sklearn.decomposition import PCA
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# Making the Python code more structured automatically
%load_ext nb_black
print("Loading Libraries... Done.")
Loading Libraries... Done.
# Loading Dataset
data_path = "Credit+Card+Customer+Data.xlsx"
data = pd.read_excel(data_path)
# Making a copy of the data to avoid any changes to original data
df = data.copy()
print("Loading Dataset... Done.")
Loading Dataset... Done.
# Checking the top 5 and bottom 5
display(df.head()) # -----looking at head (top 5 observations)
display(df.tail()) # -----looking at tail (bottom 5 observations)
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 87073 | 100000 | 2 | 1 | 1 | 0 |
| 1 | 2 | 38414 | 50000 | 3 | 0 | 10 | 9 |
| 2 | 3 | 17341 | 50000 | 7 | 1 | 3 | 4 |
| 3 | 4 | 40496 | 30000 | 5 | 1 | 1 | 4 |
| 4 | 5 | 47437 | 100000 | 6 | 0 | 12 | 3 |
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 655 | 656 | 51108 | 99000 | 10 | 1 | 10 | 0 |
| 656 | 657 | 60732 | 84000 | 10 | 1 | 13 | 2 |
| 657 | 658 | 53834 | 145000 | 8 | 1 | 9 | 1 |
| 658 | 659 | 80655 | 172000 | 10 | 1 | 15 | 0 |
| 659 | 660 | 80150 | 167000 | 9 | 0 | 12 | 2 |
Observations
Sl_No and Customer Key columns.# -----Print the dimension of the data
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns")
There are 660 rows and 7 columns
# -----Displaying information about features of the Dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 660 entries, 0 to 659 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Sl_No 660 non-null int64 1 Customer Key 660 non-null int64 2 Avg_Credit_Limit 660 non-null int64 3 Total_Credit_Cards 660 non-null int64 4 Total_visits_bank 660 non-null int64 5 Total_visits_online 660 non-null int64 6 Total_calls_made 660 non-null int64 dtypes: int64(7) memory usage: 36.2 KB
Observations
For Numerical Variables
# -----Displaying Statistical Summary of Numerical Data
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Sl_No | 660.000 | 330.500 | 190.670 | 1.000 | 165.750 | 330.500 | 495.250 | 660.000 |
| Customer Key | 660.000 | 55141.444 | 25627.772 | 11265.000 | 33825.250 | 53874.500 | 77202.500 | 99843.000 |
| Avg_Credit_Limit | 660.000 | 34574.242 | 37625.488 | 3000.000 | 10000.000 | 18000.000 | 48000.000 | 200000.000 |
| Total_Credit_Cards | 660.000 | 4.706 | 2.168 | 1.000 | 3.000 | 5.000 | 6.000 | 10.000 |
| Total_visits_bank | 660.000 | 2.403 | 1.632 | 0.000 | 1.000 | 2.000 | 4.000 | 5.000 |
| Total_visits_online | 660.000 | 2.606 | 2.936 | 0.000 | 1.000 | 2.000 | 4.000 | 15.000 |
| Total_calls_made | 660.000 | 3.583 | 2.865 | 0.000 | 1.000 | 3.000 | 5.000 | 10.000 |
Observations
Avg_Credit_Limit of a customer is about 35,000.Total_Credit_Cards is about 5.Total_visits_bank is zero).Total_visits_online is zero).Total_calls_made is zero).# Checking missing values across each columns
c_missing = pd.Series(df.isnull().sum(), name="Missing Count") # -----Count Missing
p_missing = pd.Series(
round(df.isnull().sum() / df.shape[0] * 100, 2), name="% Missing"
) # -----Percentage Missing
# Combine into 1 Dataframe
missing_df = pd.concat([c_missing, p_missing], axis=1)
# # Display missing info
# display(missing_df)
missing_df.sort_values(by="% Missing", ascending=False).style.background_gradient(
cmap="YlOrRd"
)
| Missing Count | % Missing | |
|---|---|---|
| Sl_No | 0 | 0.000000 |
| Customer Key | 0 | 0.000000 |
| Avg_Credit_Limit | 0 | 0.000000 |
| Total_Credit_Cards | 0 | 0.000000 |
| Total_visits_bank | 0 | 0.000000 |
| Total_visits_online | 0 | 0.000000 |
| Total_calls_made | 0 | 0.000000 |
Observations
# Checking for duplicate records
df.duplicated().sum()
0
Observations
# checking the number of unique values in each column
df.nunique()
Sl_No 660 Customer Key 655 Avg_Credit_Limit 110 Total_Credit_Cards 10 Total_visits_bank 6 Total_visits_online 16 Total_calls_made 11 dtype: int64
# getting the count for each unique value in Customer Key
filtered = df.groupby("Customer Key").count()
for i in filtered.loc[filtered.Sl_No >= 2].index:
display(df.loc[df["Customer Key"] == i])
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 48 | 49 | 37252 | 6000 | 4 | 0 | 2 | 8 |
| 432 | 433 | 37252 | 59000 | 6 | 2 | 1 | 2 |
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 4 | 5 | 47437 | 100000 | 6 | 0 | 12 | 3 |
| 332 | 333 | 47437 | 17000 | 7 | 3 | 1 | 0 |
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 411 | 412 | 50706 | 44000 | 4 | 5 | 0 | 2 |
| 541 | 542 | 50706 | 60000 | 7 | 5 | 2 | 2 |
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 391 | 392 | 96929 | 13000 | 4 | 5 | 0 | 0 |
| 398 | 399 | 96929 | 67000 | 6 | 2 | 2 | 2 |
| Sl_No | Customer Key | Avg_Credit_Limit | Total_Credit_Cards | Total_visits_bank | Total_visits_online | Total_calls_made | |
|---|---|---|---|---|---|---|---|
| 104 | 105 | 97935 | 17000 | 2 | 1 | 2 | 10 |
| 632 | 633 | 97935 | 187000 | 7 | 1 | 7 | 0 |
Observations
Customer Key.User Defined Functions
# -----
# User defined function to plot labeled_barplot
# -----
def labeled_barplot(data, feature, perc=False, v_ticks=True, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
if v_ticks is True:
plt.xticks(rotation=90)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# -----
# User defined function to prints the 5 point summary and histogram, box plot,
# and cumulative density distribution plots
# -----
def summary(data, x):
"""
The function prints the 5 point summary and histogram, box plot,
and cumulative density distribution plots for each
feature name passed as the argument.
Parameters:
----------
x: str, feature name
Usage:
------------
summary('age')
"""
x_min = data[x].min()
x_max = data[x].max()
Q1 = data[x].quantile(0.25)
Q2 = data[x].quantile(0.50)
Q3 = data[x].quantile(0.75)
dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
ldf = pd.DataFrame(data=dict, index=["Value"])
print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
print(tabulate(ldf, headers="keys", tablefmt="psql"))
fig, axs = plt.subplots(nrows=3, ncols=1, figsize=(16, 22))
sns.set_palette("Pastel1")
# Histogram
ax1 = sns.distplot(data[x], color="purple", ax=axs[0])
ax1.axvline(np.mean(data[x]), color="purple", linestyle="--")
ax1.axvline(np.median(data[x]), color="black", linestyle="-")
ax1.set_title(f"{x.capitalize()} Density Distribution")
# Boxplot
ax2 = sns.boxplot(
x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True, ax=axs[1]
)
ax2.set_title(f"{x.capitalize()} Boxplot")
# Cummulative plot
ax3 = sns.kdeplot(data[x], cumulative=True, linewidth=1.5, ax=axs[2])
ax3.set_title(f"{x.capitalize()} Cumulative Density Distribution")
plt.subplots_adjust(hspace=0.4)
plt.show()
# Loop through all the variables of interest and perform a univariate statistical summary
for col in df.columns[2:]:
print("=" * 115, "\n\n")
summary(df, col)
=================================================================================================================== 5 Point Summary of Avg_credit_limit Attribute: +-------+-------+-------+-------+-------+--------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+-------+-------+-------+--------| | Value | 3000 | 10000 | 18000 | 48000 | 200000 | +-------+-------+-------+-------+-------+--------+
=================================================================================================================== 5 Point Summary of Total_credit_cards Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 1 | 3 | 5 | 6 | 10 | +-------+-------+------+------+------+-------+
=================================================================================================================== 5 Point Summary of Total_visits_bank Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 1 | 2 | 4 | 5 | +-------+-------+------+------+------+-------+
=================================================================================================================== 5 Point Summary of Total_visits_online Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 1 | 2 | 4 | 15 | +-------+-------+------+------+------+-------+
=================================================================================================================== 5 Point Summary of Total_calls_made Attribute: +-------+-------+------+------+------+-------+ | | Min | Q1 | Q2 | Q3 | Max | |-------+-------+------+------+------+-------| | Value | 0 | 1 | 3 | 5 | 10 | +-------+-------+------+------+------+-------+
Observations
Avg_credit_limit is skewed to the right.Avg_credit_limit is greater than the midean and there are outliers to the right.Avg_credit_limitTotal_credit_cards is 1 and the maximum is 10Total_credit_cards or less.Total_visits_bank. It will be interesting to see if they are in a cluster.Total_visits_online. It will be interesting to see if they are in a cluster.Total_visits_online is skewed to the right and has outliers.Total_calls_made. It will be interesting to see if they are in a cluster.Total_calls_made is slightly skewed to the right but there are no outliers.for col in df.columns[3:]:
labeled_barplot(df, col, perc=True)
Observations
Total_visits_bank form about 15% of the Customer base.Total_visits_online form about 21% of the Customer base.Total_calls_made form about 15% of the Customer base.# Define the Correlation Matrix
cm = df.iloc[:, 2:]
# Display the matrix
plt.figure(figsize=(15, 7))
sns.heatmap(cm.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
# Create a pairplot
sns.pairplot(data=cm, diag_kind="kde", plot_kws={"color": "purple"})
plt.show()
Observations
Total_Credit_Cards and Avg_Credit_Limit are positively correlated. That's to be expected.Avg_Credit_Limit and Total_visits_online are positively correlated, implying that online banking customers tend to have a high credit.Total_visit_bank is negatively correlated to Total_visits_online and Total_calls_made. This is to be expected.Avg_Credit_Limit and Total_calls_made are negatively correlated. This means that Customers that use telephone banking tend to have lower credit limits. sns.pairplot(
data=df[
[
"Total_visits_bank",
"Total_visits_online",
"Total_calls_made",
"Total_Credit_Cards",
]
],
hue="Total_Credit_Cards",
)
plt.show()
Observations
# Create a 3D Plot that shows the relationship of how the customers connect with the bank
# Create figure and axis
fig = plt.figure(figsize=(10, 10))
ax = plt.axes(projection="3d")
# Initialize data
x = df["Total_visits_bank"]
y = df["Total_visits_online"]
z = df["Total_calls_made"]
# Create a 3D scatter plot
ax.scatter(x, y, z, c=z, cmap="viridis")
# Set labels and title
ax.set_xlabel("Total_visits_bank")
ax.set_ylabel("Total_visits_online")
ax.set_zlabel("Total_calls_made")
ax.set_title("3D Scatter Plot")
# Set the azimuthal angle
ax.view_init(azim=60)
# Display the plot
plt.show()
########################################################################
# The code commented out below was used to generate the gif of the 3D plot.
# The gif was embadded as a markdown independently.
########################################################################
########################################################################
# fig = plt.figure(figsize=(10, 10))
# ax = Axes3D(fig)
# # Initialize data
# x = df["Total_visits_bank"]
# y = df["Total_visits_online"]
# z = df["Total_calls_made"]
# g = ax.scatter(x, y, z, c=x, marker="o", depthshade=False, cmap="Paired")
# ax.set_xlabel("Total Bank Visits")
# ax.set_ylabel("Total Visits Online")
# ax.set_zlabel("Total Calls Made")
# # produce a legend with the unique colors from the scatter
# legend = ax.legend(
# *g.legend_elements(),
# loc="lower center",
# title="Total bank visits",
# borderaxespad=-10,
# ncol=4
# )
# ax.add_artist(legend)
# def rotate(angle):
# ax.view_init(azim=angle)
# angle = 1
# ani = animation.FuncAnimation(fig, rotate, frames=np.arange(0, 360, angle), interval=1)
# ani.save("Cluster_plot.gif", writer=animation.PillowWriter(fps=25))
########################################################################